CitiBike program have tasked us with the responsibility of analyzing data that they have collected about rides in 2019. In order to identify patterns in the ride history data, we investigated the relationship between all given factors. We focused on the demographics of the riders as well as the how and when these riders tend to use the bikes. We believe these aspects could contribute to an increase in ridership and profit.
Our report provides findings that showcase how riders tend to be 39 years or younger, the majority of riders are subscribers, inclement weather negatively affects ride time, the popular commute times are during the morning and late afternoon/evenings and certain stations have a surplus or deficit of bikes. The report also includes recommendations for the CitiBike program based on findings in terms of marketing efforts, pricing strategies and managing surpluses and deficits.
citi %>% ggplot(aes(x=gender,fill=gender)) + geom_bar(alpha=.8) + theme_fivethirtyeight() + scale_fill_brewer(palette="Set2")+theme(legend.position="none")+ggtitle(expression(atop("Gender of Citi Bikers")))
When looking at the distribution of riders based on gender, it is clear that the majority of riders in the 2019 data set were males (68%). Females and unidentified genders only make up a combined 32% of riders in 2019.
citi %>% ggplot(aes(x=age_category,fill=age_category)) + geom_bar(alpha=.8)+theme_fivethirtyeight()+scale_fill_brewer(palette="Set2")+theme(legend.position="false")+ggtitle(expression(atop("Distribution by Age Group")))
In terms of the age distribution, it is clear that the majority of riders tend to come from younger age groups. Around 56.3% of riders in 2019 were 39 years and under, meaning only 43.7% of riders were 40 and over.
ggplot(data = citi, aes(x = age, y = distMiles)) + geom_point() + geom_smooth(method = 'lm') + ggtitle("Comparison of Age to Distance traveled") + theme_fivethirtyeight()
## `geom_smooth()` using formula 'y ~ x'
ggplot(citi) + geom_bar(aes(time, fill=gender)) + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) + ggtitle(expression(atop("Commute Times by Gender"))) + scale_fill_manual(values = c("lightgreen", "lightblue","pink")) + scale_fill_brewer(palette="Set2") + theme_fivethirtyeight() #+transition_states(states = time)+enter_fade() + shadow_mark()
The most popular times for rides seem to occur during the morning commute time of 7am-10am as well as the late afternoon / evening commute of 4pm to 7pm. Using information from the previous graph that showcases how majority of riders are subscribers that utilize the bikes more heavily during the weekdays, we can see that these time gaps correlate to working hours. Thus, subscribers usually take the bikes to get to work in the morning and then to get back home in the late afternoons/evenings. This information could be useful for CitiBike to use to make sure bikes are available for use during the most frequented times of day.
ggplot(avg_speed_age) + geom_line(aes(as.numeric(age), mean_speed, color = gender)) + labs(title = "Average Speed based on Age and Gender in 2019", x = "Age", y = "Average Speed (Miles Per Hour)") + theme(plot.title = element_text(hjust = 0.7)) + theme_fivethirtyeight() + scale_color_discrete(name = "Gender", labels = c("Female", "Male")) #+ transition_reveal(age)
ggplot(citi) + geom_bar(aes(x=weekday, y=(..count..)/sum(..count..), fill=usertype)) + ggtitle("Rides on days of week, by usertype") + theme_fivethirtyeight() + xlab("Weekday") + ylab("Percentage of All Rides")
ggplot(citi) + geom_bar(aes(x = hour, y=(..count..)/sum(..count..), fill = usertype)) + ylab("% of rides") + ggtitle("Rides per hour of the day, by usertype")
plot1 <- ggplot(top_start_count, aes(reorder(Var1,-Freq), Freq)) + geom_bar(stat="identity", fill = "lightblue") + geom_text(aes(label=Freq), vjust=-0.3, size=3.5) +
theme_minimal() + theme(axis.text.x = element_text(angle = 60, vjust = 0.5, hjust=0.5)) + ggtitle("Top 10 Start Stations") + xlab("Station Name") + ylab("Total Trips Started Out of station")
plot2 <- ggplot(top_end_count, aes(reorder(Var1,-Freq), Freq)) + geom_bar(stat="identity", fill = "lightblue") + geom_text(aes(label=Freq), vjust=-0.3, size=3.5) +
theme_minimal() + theme(axis.text.x = element_text(angle = 60, vjust = 0.5, hjust=0.5)) + ggtitle("Top 10 End Stations") + xlab("Station Name") + ylab("Total Trips Ended at Station")
grid.arrange(plot1, plot2, ncol= 2)
#start stations
leaflet(citi) %>%
addProviderTiles(providers$providers$CartoDB.DarkMatter) %>%
setView(lng = -73.98928, lat = 40.75042, zoom = 10) %>%
addMarkers(lng = citi$start.station.longitude, lat = citi$start.station.latitude,
popup = "Starting")
# #end stations
leaflet(citi) %>%
addProviderTiles(providers$providers$CartoDB.DarkMatter) %>%
setView(lng = -73.98928, lat = 40.75042, zoom = 10) %>%
addMarkers(lng = citi$end.station.longitude, lat = citi$end.station.latitude,
popup = "Ending")
#map by deficit (all stations)
leaflet(bike_deficit) %>%
addTiles() %>%
setView(-74, 40.75, zoom = 11.5) %>%
addCircleMarkers(lng = bike_deficit$longitude, lat = bike_deficit$latitude,
popup = paste(bike_deficit$station, "<br>", ifelse(bike_deficit$deficit>=0, "Bike deficit = ", "Bike surplus = "),
abs(bike_deficit$deficit)),
radius = abs(bike_deficit$deficit)/5, color = ifelse(bike_deficit$deficit>0, "red", "green"))
#top 5 surplus
bike_surplus_5 <-arrange(bike_deficit, (deficit))[1:5,]
popup_1 <- paste0("<b>", bike_surplus_5$station, "</b><br>",
"Deficit/Surplus: ", bike_surplus_5$deficit, "<br>",
"Arrival Count: ", bike_surplus_5$count_arrival, "<br>",
"Departure Count: ", bike_surplus_5$count_dep, "<br>")
leaflet() %>%
addTiles() %>%
addMarkers(lng = bike_surplus_5$longitude, lat = bike_surplus_5$latitude, popup = popup_1)
gt(bike_surplus_5) %>%
cols_label(
station = "Station",
latitude = "Latitude",
longitude = "Longitude",
count_dep = "Departures",
count_arrival = "Arrivals",
deficit = "Surplus")
| Station | Latitude | Longitude | Departures | Arrivals | Surplus |
|---|---|---|---|---|---|
| Pershing Square North | 40.7519 | -73.9777 | 1214 | 1347 | -133 |
| Broadway & E 22 St | 40.7403 | -73.9895 | 906 | 1009 | -103 |
| West St & Chambers St | 40.7176 | -74.0132 | 874 | 975 | -101 |
| Old Fulton St | 40.7028 | -73.9938 | 262 | 348 | -86 |
| E 2 St & 2 Ave | 40.7250 | -73.9907 | 413 | 490 | -77 |
#top 5 deficit
bike_deficit_5<-arrange(bike_deficit, -deficit)[1:5,]
popup_2 <- paste0("<b>", bike_deficit_5$station, "</b><br>",
"Deficit/Surplus: ", bike_deficit_5$surplus, "<br>",
"Arrival Count: ", bike_surplus_5$count_arrival, "<br>",
"Departure Count: ", bike_surplus_5$count_dep, "<br>")
leaflet() %>%
addTiles() %>%
addMarkers(lng = bike_deficit_5$longitude, lat = bike_deficit_5$latitude, popup = popup_2)
gt(bike_deficit_5) %>%
cols_label(
station = "Station",
latitude = "Latitude",
longitude = "Longitude",
count_dep = "Departures",
count_arrival = "Arrivals",
deficit = "Deficit")
| Station | Latitude | Longitude | Departures | Arrivals | Deficit |
|---|---|---|---|---|---|
| W 59 St & 10 Ave | 40.7705 | -73.9880 | 370 | 285 | 85 |
| Columbus Ave & W 72 St | 40.7771 | -73.9790 | 510 | 432 | 78 |
| E 32 St & Park Ave | 40.7457 | -73.9820 | 592 | 525 | 67 |
| 1 Ave & E 30 St | 40.7414 | -73.9754 | 392 | 326 | 66 |
| Carmine St & 6 Ave | 40.7304 | -74.0022 | 649 | 586 | 63 |
leaflet(citistations) %>%
addTiles() %>%
addCircleMarkers(lat = citistations$lat, lng = citistations$long, popup = citistations$name, radius = citistations$count/100, color = ifelse(citistations$usertype == "Subscriber", "blue", "red"))
leaflet(citistations_gender) %>%
addTiles() %>%
addCircleMarkers(lat = citistations_gender$lat, lng = citistations_gender$long, popup = citistations_gender$name, radius = citistations_gender$count/100, color = ifelse(citistations_gender$gender == "male", "blue", "red"))
ggplot(citi) + geom_histogram(aes(x = bikeid), stat= "count") + ggtitle("Count of rides per bikeid")
## Warning: Ignoring unknown parameters: binwidth, bins, pad
citi %>% group_by(bikeid) %>% summarise(total = sum(tripduration)) %>% ggplot(aes(reorder(bikeid, total), total)) + geom_col() + scale_y_log10(labels = comma) + ggtitle("Total Duration of Rides per Bike") + theme_fivethirtyeight() + theme(axis.title.x=element_blank(),
axis.text.x=element_blank(),
axis.ticks.x=element_blank()) + theme(axis.title = element_text()) + ylab("Duration (seconds)")
citi %>% group_by(bikeid) %>% summarise(mean = mean(speedMilesperHour)) %>% filter(mean > 0) %>% ggplot(aes(reorder(bikeid, mean), mean)) + geom_col() + ggtitle("Average Speed per Bike") + theme_fivethirtyeight() + theme(axis.title.x=element_blank(),
axis.text.x=element_blank(),
axis.ticks.x=element_blank()) + scale_y_continuous(labels = comma) + theme(axis.title = element_text()) + ylab("Miles per Hour")
ggplot(citi) + geom_histogram(aes(x = TAVG, fill = usertype)) + xlim(0,100) + ggtitle("Ride by Average Temperatures, by usertype") + theme_fivethirtyeight()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 4 rows containing missing values (geom_bar).
ggplot(data = citi, aes(y = SNOW, x = tripduration/60)) + geom_point(alpha = .1, color = "red") + labs(x = "duration") +labs(y = "snow accumulation") + scale_x_log10() + ggtitle("Duration of rides when snowing") + theme_fivethirtyeight() + theme(axis.title = element_text()) + ylab("Amount of snow falling") + xlab("Duration of ride (minutes)")
ggplot(data = citi, aes(y = PRCP, x = tripduration/60)) + geom_point(alpha = .1, color = "blue") + labs(x = "duration") +labs(y = "Precipitation accumulation") + scale_x_log10() + ggtitle("Duration of rides when raining") + theme_fivethirtyeight() + theme(axis.title = element_text()) + ylab("Amount of rain falling") + xlab("Duration of ride (minutes)")
ggplot(data = citi, aes(y = AWND, x = tripduration/60)) + geom_point(alpha = .1, color = "darkgreen") + labs(x = "duration") +labs(y = "Precipitation accumulation") + scale_x_log10() + ggtitle("Duration of rides by wind speed") + theme_fivethirtyeight() + theme(axis.title = element_text()) + ylab("Wind speed") + xlab("Duration of ride (minutes)")
Based on our findings, we recommend that CitiBike take the following actions:
In terms of their marketing efforts, Citibike should consider implementing demographic specific initiatives. They should focus their attention on increasing marketing efforts that target women and elderly individuals. They could also help attract more female user by emphasizing safety initiatives to remove deterrents. In addition, they can offer senior discounts to target the older demographic.
Target certain customer types based on day of week : Improve low ridership from customers on weekdays, and from subscribers on weekends by offering discounts on bikes (for customers) and ebikes (subscribers)
Vary pricing strategies based on inclement weather : Lower prices/offer discounts on rainy or snowy days & days when temp is below 50 degrees
Manage surpluses and deficits : Incentivize riders to drop off/pick up bikes from certain stations that tend to have extra/not enough bikes available